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ABSTRACT 


In order to avoid fraudulent online job postings, we use an automated tool that uses natural language processing 
(NLP) and classification techniques based on machine learning are suggested on paper. Using the NLP library 
SpaCy in python we have performed various analyzes such as semantic, syntactic, tokenization of the task 
profile extracting features and using a machine learning algorithm called Random Forest we have predicted its 


accuracy to classify a job profile as Real or Fake. 
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LINTRODUCTION 


Employment scam is one of the most serious issues 
in the recent history of cybercrime. Many 
organizations have recently decided to publicize 
their job openings so that job seekers may find 
them conveniently and quickly.However, this goal 
can be a scam for fraudsters because they hire job 
seekers by taking money from them. Fake job 
advertisements may be sent to a reputable company 
for breach of trust. This discovery of fake jobs 
highlights the importance of developing an 
automated method for detecting false jobs and 
disclosing them to the public while preventing job 
solicitations. 


In order to detect fake jobs, a machine learning 
approach is used, which employs numerous 
categorization algorithms. The Classification 
Algorithm isolates the fake job profile from the 
larger dataset of job advertisements. 


We introduced the Random Forest Classification 
approach of supervised learning with the NLP 
package SpaCy to handle the problem of detecting 
fraudulent employment. 


2. RELATEDWORK 


Email spam identification and fake news detection, 
according to various research, have gotten a lot of 
interest in the field of online fraud detection. In the 


field of online fraud detection, email spam 
detection and fake news identification have gotten 
a lot of attention. 


2.1 Email SpamDetection- 


Lots of unwanted emails, which are part of the 
Spam emails section, usually arrive in the user's 
inbox. This can lead to inevitable storage problems 
and bandwidth usage. To address this issue, Gmail, 
Yahoo Mail, and Outlook have included spam 
filters based on Neural Networks.. When dealing 
with email spam detection problem, content-based 
filtering, status-based filtering, heuristic-based 
filtering, memory or sample-based filtering, 
flexible spam filtering methods are considered. 


2.2 Fake NewsDetection- 


False news on social media exposes malicious user 
accounts, echo chamber results. The basic research 
for the detection of false stories is based on three 
perspectives - how false stories are written, how 
false stories spread, how the user is associated with 
false stories. Features associated with news and 
social content have been eliminated, and a machine 
learning model has been built up to detect 
misleading stories. 


3. PROPOSEDMETHODOLOGY 


The aim of this research is to discover whether the 
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work is fraudulent or not.. Identifying and ending 
these fake job advertisements will help job seekers 
focus only on official positions. In this context, a 
set of data from Kaggle is used that provides 
information about a job that you may or may not 
suspect. The data set has a schema as shown in Fig. 
L. 


job_id int64 
title ob ject 
location object 
department object 
salary_range object 
company profile object 
description object 
requirements object 
benefits object 
telecommuting int64 
has_company_logo int64 
has Questions inta 
emp loyment_type object 
required experience object 
required _education object 
industry object 
function object 
Fraudulent int64 


Fig 1 Schema Structure of the dataset 


This dataset contains 17,880 number of job posts.In 
order to better understand the intended purpose as a 
basis, a multi-step process is followed to obtain an 
equal database. Before entering this data into any 
category, some of the previous processing methods 
are applied to this database .Missing values, stop- 
words, irrelevant attribute deletion, and additional 
space are some of the pre-processing approaches. 
Fig 2 explain the architecture diagram. 
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Fig 2 Architecture Diagram 


3.1 Data Preprocessing 


Pre-data processing involves converting raw data into 
well-structured data sets for data mining statistics. 
Raw data is frequently incomplete and formatted 
inconsistently. The adequacy or inadequacy of data 
correction has a direct impact on the effectiveness of 
any data analysis activity. 


3.1.1 Data Visualization 


Data visualisation is a valuable skill because it allows 
us to obtain a qualitative knowledge of data. This is 
useful for studying and learning about the data set, as 
well as spotting patterns, corrupt data, and outliers. 


In terms of bar graphs, pie charts, and other forms 
of data visualisation, essential relationships can be 
expressed and demonstrated.Here we show the bar 
graph of country wise job available Fig 3 and degree 
wise job available Fig 4 
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Fig 3 Country wise job posting 
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Fig 5 No of Jobs as per experience 
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3.1.2 Feature Selection 


When building a predictive model, feature 
extraction is the process of minimizing the number 
of input variables. It is essential to limit the number 
of input variables in order to lower the model's 
calculation costs and, in some situations, to 
increase the model's performance. 


3.2 NLP Preprocessing 


Natural language processing (NLP) is a collective 
term referring to the automatic processing of 
human languages. This includes both algorithms 
that take man-made text as input, and algorithms 
that produce text that looks natural as output. 


3.2.1 Word Cloud 


It is a visualization technique for text data to 
identify the stopping words and to extract the 
important word based on its frequency. 


3.2.2 SpaCy 


SpaCy is a free, open-source library for NLP in 
Python. It's written in Cython and is designed 
to build information extraction or natural language 
understanding systems. In this project we done 
feature extraction through Lemmatization process 
of grouping together the inflected forms of a word 
so they can be analyzed as a single item, identified 
by the word lemma, or dictionary 
form.Andtokenization used to break the sentence 
into separatewords or tokens. 


3.3 Implementation of Classifier 


We use Random Forest Classifier it is a meta 
estimator that fits a number of decision tree 
classifiers on various sub-samples of the dataset 
and uses averaging to improve the predictive 
accuracy and control over-fitting. 


The purpose of randomness is to decrease the 
variance of the forest estimator. Indeed, individual 
decision trees typically exhibit high variance and 
tend to overfit. 


The injected randomness in forests yield decision 
trees with somewhat decoupledprediction errors. 
By taking an average of those predictions, some 
errors can cancel out. 
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Fig 6 Random Forest classifier structure 


Random forests achieve a reduced variance by 
combining diverse trees, sometimes at the cost of a 
slight increase in bias. In practice the variance 
reduction is often significant hence yielding an 
overall better model 


4. Experimental Analysis 


Above mentioned dataset is trained and tested to 
find fake job vacancies in a given database that 
contains both false and official posts. The 
following Table shows the Classification report of 
the prediction done through random forest 
algorithm. Precision(totalpositive) shows of 97 %. 
F1 score for real jobs is 0.99 and for fake job 0.58 


Classification Report 
precision recall fl-score support 


0 0.9/7 1.0% 0.99 5104 
1 1.00 0.40 058 260 
accuracy 0.97 5364 


macro avg 0.9 0.76 0.78 5364 
weighted avg 0.97 097 975364 


Fig 7 Classification report of Prediction 
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Although this random Forest Classifier has 
obtained a Fl-score that closely resembles other 
competitors, but this filter has shown significant 
performance in relation to other metrics. 
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Fig 8 Confusion Matrix 
5.Conclusion 


Employment scam detection will guide job-seekers 
to get onlylegitimate offers from companies. For 
tackling employment scam detection, several 
machine learning algorithms are proposed as 
countermeasures in this paper. Supervised 
mechanism is used to exemplify the use of several 
classifiers for employment scam detection. 
Experimental results indicate that Random Forest 
classifier outperforms over its peer classification 
tool giving the accuracy of 97.1%. 


References 


[1] Bandyopadhyay, Samir & Dutta, Shawn. 
(2020). Fake Job Recruitment Detection Using 
Machine Learning Approach. International Journal 
of Engineering Trends and Technology. 68. 
10.14445/22315381/IJETT- V68I4P209S. 


[2] Cutler, Adele & Cutler, David & Stevens, John. 
(2011). Random Forests. 10.1007/978-1-4419- 
9326-7_5. 


[3] H. Sharma and S. Kumar, —A Survey on 
Decision Tree Algorithms of Classification in Data 
Mining,| Int. J. Sci. Res., vol. 5, no. 4, pp. 2094— 
2097, 2016, doi: 10.21275/v5i4.nov162954 


[4] B. Alghamdi and F. Alharby, —An Intelligent 
Model for Online Recruitment Fraud Detection,” J. 
Inf. Secur., vol. 10, no. 03, pp. 155—176, 2019, doi: 


10.4236/jis.2019.103009 


[5] Shivam Bansal (2020, February). [Real or Fake] 
Fake Job PostingPrediction,Version 1.Retrieved 
March 29,2020 from _https://www.kaggle.com/ 
shivamb/real-or-fake-fake-jobposting-prediction 


[6] Redd, Mallamma V., and M. Hanumanthappa. 
"Semantical and  Syntactical Analysis of 
NLP." IJCSIT) International Journal of Computer 
Science and Information Technologies 5, no. 3 
(2014). 


[7] Webster, J. J., & Kit, C. (1992). Tokenization as 
the initial phase in NLP. In COLING 1992 Volume 
4: The 14th International Conference’ on 
Computational Linguistics. 


[8] A. Natekinand A. Knoll, -Gradient boosting 
machines, a tutorial,| Front. Neuroro bot., vol. 7, 
no. DEC, 2013, doi: 10.3389/fnbot.2013.00021. 


[9] N. Hussain, H. T. Mirza, G. Rasool, I. Hussain, 
and M. Kaleem, -Spam review detection 
techniques: A systematic literature review,| Appl. 
Sci., vol. 9, no. 5, pp. 1-26, 2019, 
doi: 10.3390/app9050987. 


[10] K.Shu, A. Sliva, S.Wang, J.Tang,andH.Liu,- 
Fake News Detection on Social 
Media,IACMSIGKDD Explor.News lett., vol. 19, 
no. l, pp. 22—36, 2017, doi: 
10.1145/3137597.3137600. 


[11] H. M and S. M.N, -A Review on Evaluation 
Metrics for Data Classification Evaluations,| Int. J. 
Data Min. Knowl. Manag. Process, vol. 5, no. 2, 
pp. 01-11, 2015, doi: 10.5121/ijdkp.2015.5201. 


